Road Pattern dataset description

The reference data-set used in this workshop come from https://geo.nyu.edu/catalog/stanford-yk247bg4748; This source contains polygons of urban areas all over the world with population estimate. Urbans ares from North and Central America and from Europe and Russia with a population of more than 200 000 peoples were extracted form this data source.

library(sf)
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(cartography)
## Loading required package: sp
set.seed(4)
cities = read_sf("./shp/Cities.shp") %>% filter(ang180!=0) %>% mutate(Cont=factor(Cont))
cont=read_sf("./shp/continent.shp")
fond = st_crop(st_transform(st_geometry(cont),3395),st_bbox(st_transform(cities,3395)))
plot(fond, col = "grey60",border = "grey20")
propSymbolsTypoLayer(st_centroid(st_transform(cities,3395)),var="Pop",var2="Cont",col=c("#e41a1c","#377eb8"),legend.var.title.txt="Population",legend.var2.title.txt="Continent              ",legend.var2.pos = "bottomleft",legend.var.pos = "n",legend.var2.frame=TRUE)
legendCirclesSymbols(pos = "topleft", 
                     var = c(200000,500000,1000000,5000000,10000000),
                     inches = 0.2, style = "e",col="gray",title.txt="Population",frame=TRUE)

These data were combined with the street networks extracted from open street map. This is for example the street network extracted for Pensacola and Seville :

par(mfrow=c(1,2))
roads_app = read_sf("./UrbanAreas/Pensacola.geojson")
plot(st_geometry(roads_app))
roads_app = read_sf("./UrbanAreas/Seville.geojson")
plot(st_geometry(roads_app))

The final dataset contains informations on 587 urban areas with the folowing features :

str(cities[,1:6])
## Classes 'sf', 'tbl_df', 'tbl' and 'data.frame':  587 obs. of  7 variables:
##  $ City    : chr  "Aarhus" "Acapulco" "Aguascalientes" "Akron" ...
##  $ Pop     : num  227100 715584 705459 524140 387743 ...
##  $ Cont    : Factor w/ 2 levels "Europe","North America": 1 2 2 2 2 2 2 2 1 2 ...
##  $ ang180  : num  1452 1421 2285 22910 1906 ...
##  $ ang185  : num  1393 1342 1832 4659 1664 ...
##  $ ang190  : num  1556 1042 1542 3772 1332 ...
##  $ geometry:sfc_MULTIPOLYGON of length 587; first list element: List of 1
##   ..$ :List of 2
##   .. ..$ : num [1:175, 1:2] 10.3 10.3 10.3 10.3 10.3 ...
##   .. ..$ : num [1:5, 1:2] 10.2 10.2 10.2 10.2 10.2 ...
##   ..- attr(*, "class")= chr  "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA
##   ..- attr(*, "names")= chr  "City" "Pop" "Cont" "ang180" ...

were ang180,ang185,… corresponds to the histogram of street angle for each urban area. The goal of this workshop is to study the link between street network shape or urban fabric and continent of origin. Population density already exhibit difference between Europe and North-America as shown by the fllowing histogram with a lot more of dense urban areas in Europe.

df_cities = cities %>% mutate(Area=as.numeric(st_area(geometry))) %>% mutate(Density=Pop/Area*10000)
ggplot(df_cities)+geom_histogram(aes(x=Density,fill=Cont),bins=25)+
  scale_fill_brewer(palette="Set1")+theme_bw()

If we look at the street angles the diffrence between Europe and North-America is also quite visible as shown with these samples.

AnglesCounts = as.data.frame(cities)[,4:75]
df_Pa=AnglesCounts/rowSums(AnglesCounts)
levs=colnames(df_Pa)
df_Pa$City=cities$City
df_angles=gather(df_Pa,"deg","prob",-City) %>% mutate(deg=factor(deg,levels=levs)) 

exNA= cities%>% filter(Cont=="North America")  %>% sample_n(9)
ggNA=ggplot()+geom_bar(data=df_angles %>% right_join(exNA),aes(x=deg,y=prob,color=Cont,fill=Cont),stat="identity")+
  scale_y_continuous("",breaks=c())+scale_x_discrete("",breaks=c())+
  coord_polar()+facet_wrap(~City)+theme_bw()+
  scale_fill_brewer(palette="Set1",drop=FALSE,guide="none")+
  scale_color_brewer(palette="Set1",drop=FALSE,guide="none")
## Joining, by = "City"
exEu= cities%>% filter(Cont=="Europe")  %>% sample_n(9)
ggEu=ggplot()+geom_bar(data=df_angles %>% right_join(exEu),aes(x=deg,y=prob,color=Cont,fill=Cont),stat="identity")+
  scale_y_continuous("",breaks=c())+scale_x_discrete("",breaks=c())+
  coord_polar()+facet_wrap(~City)+theme_bw()+
  scale_fill_brewer(palette="Set1",drop=FALSE,guide="none")+
  scale_color_brewer(palette="Set1",drop=FALSE,guide="none")
## Joining, by = "City"
grid.arrange(ggEu,ggNA, nrow=1, ncol=2,top="")

A lot of urban areas from North-America exhibit a grid pattern which are clearly visbibles in the street orientation histograms.

In this workshop we will first build classifiers to recognize the region of an urban areas using two variables: - the population density - a feature extracted from the histogram;

This will enbale the visualisation of classifier results for pedagogical goals. In a second step we will tune classifier to recognize the region of origin using the raw histograms as input and finaly we will introduce ourself to the keras package for deep-learning by trying to classify image patches of 200x200 pixels (which correspond to 3km square patch) using convolutional neural networks, some example of training patch follow :

|Amsterdam Patch |Columbus Patch |Kansas cityPatch |Turin Patch

The two main library that will be used in tjis workshop except the classical dplyr,tidyr,ggplot,.. are keras for neural network and caret. Caret offers a unified interface to a lot of machine learning algorithms and will ease some important tasks such as hyper parameter tunning, model evaluation,…

library(caret)
library(keras)

Supervised classification on the plane

A classical first step in bulding a classifier is to split the dataset into a training and a testing set and performing some preprocessing such as feature standardization (subtract the mean, and divide by the standard deviation each feature). In this first code block we compute the entropie of the road orientation histogram, this will give us a measure of the predictability of these angles (intuitively this measuree will be low for grids and hhigh for historical centers with a lot more of variations in the angles). We then build a data frame Xp with three varibales :

entropie = -rowSums(df_Pa %>% select(-City)*log2(df_Pa %>% select(-City)))
Xp = as.data.frame(df_cities) %>% select(Cont,Density) 
Xp$entropie=entropie
summary(Xp)
##             Cont        Density           entropie    
##  Europe       :311   Min.   :  3.862   Min.   :3.565  
##  North America:276   1st Qu.: 13.102   1st Qu.:5.920  
##                      Median : 23.139   Median :6.100  
##                      Mean   : 27.022   Mean   :5.989  
##                      3rd Qu.: 35.738   3rd Qu.:6.147  
##                      Max.   :128.762   Max.   :6.169

The continent of the urban area (North America,Europe) which is the target feature and the two predictors. Then we may use the createDataPartition and preProcess functions to split and normalize this dataset (normalization is important here for several methods since raw features have quite different ranges). Note that the mean and standard deviation used for scaling are computed on the train dataset only.

intrain  = createDataPartition(Xp$Cont, list=FALSE,p = 0.75)
scaler = preProcess(Xp[intrain,-1])
Xp.train = predict(scaler,Xp[intrain,-1])
Xp.train$Cont=Xp$Cont[intrain]
Xp.test  = predict(scaler,Xp[-intrain,-1])
Xp.test$Cont=Xp$Cont[-intrain]

We may get a first picture of this simple dataset using a scatter plot. Urban areas from North-America exhibit a lower density and an higher entropie. These variables may then be helpfull to distinguish these two classes. We will start by running two classical classifier namely logistic regression and k-nn.

ggplot(Xp.train)+geom_point(aes(x=Density,y=entropie,color=Cont),size=3)+
  scale_color_brewer(palette="Set1",drop=FALSE)+
  coord_equal()+theme_bw()

GLM

We start with a simple logistic regression, which can easily be estimated in R thanks to the GLM function. The target feature must be a factor with two levels and the family argument must be set to binomial. The predict function return log-odds by default but the “response” type argument can be used to get probabilities. These probaibilities can then be used to reconstruct the most probable decision and compute a confusion matrix on the train set by comparing the true label with the predicted ones.

fit = glm(Cont~.,data=Xp.train,family = binomial)
classes = levels(Xp.train$Cont)
predlab.train = factor(classes[(predict(fit,Xp.train,type="response")>0.5)+1],levels = classes)
mc.train = confusionMatrix(predlab.train,Xp.train$Cont)
mc.train
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Europe North America
##   Europe           211            84
##   North America     23           123
##                                           
##                Accuracy : 0.7574          
##                  95% CI : (0.7146, 0.7967)
##     No Information Rate : 0.5306          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5045          
##  Mcnemar's Test P-Value : 6.615e-09       
##                                           
##             Sensitivity : 0.9017          
##             Specificity : 0.5942          
##          Pos Pred Value : 0.7153          
##          Neg Pred Value : 0.8425          
##              Prevalence : 0.5306          
##          Detection Rate : 0.4785          
##    Detection Prevalence : 0.6689          
##       Balanced Accuracy : 0.7480          
##                                           
##        'Positive' Class : Europe          
## 

The confusionMatrix from the caret package give a lot of statistics used in classification to assess the performances of a classifier derived from the confusion matrix. Lets focus on accuracy for the moment, which is close to 76%. Here we used the training set to compute this statistics and this estimation of the performances of our classifier could be optimistic in case of over-fitting. We may used our testing set to get a more robust estimation of the performance the perforamces :

predlab.test = factor(classes[(predict(fit,Xp.test,type="response")>0.5)+1],levels = classes)
mc.test = confusionMatrix(predlab.test,Xp.test$Cont)
mc.test$overall[1]
##  Accuracy 
## 0.7534247

We get an accuracy close to the one obatined on the training set but a little bit lower. This is logical since logistic regression with few feature is not prone to over-fitting (only 3 parameters are estimated here). We may plot the decision function of our classifier to visualize the simplicity of our decision function.

# build a grid 
Xpgrid= expand.grid(Density=seq(min(Xp.train$Density),max(Xp.train$Density),length.out=100),entropie=seq(min(Xp.train$entropie),max(Xp.train$entropie),length.out=100))
# use the classifier to get the prediction on the grid
proba = predict(fit,Xpgrid,type="response")
# reverse the probabilities to get probabilities of "North America"
Xpgrid$proba=1-proba
# draw the probabilities and the trainsing set sample
ggplot(Xpgrid)+geom_tile(aes(x=Density,y=entropie,fill=proba),alpha=0.4)+
  scale_fill_distiller(palette="RdBu")+
  geom_point(data=Xp.train,aes(x=Density,y=entropie,shape=Cont,color=Cont),size=2)+
  scale_color_brewer(palette="Set1",drop=FALSE)+
  coord_equal()+theme_bw()

As expected the decision boundary is linear with repsect to the features. Let move on to another classic classifier based on nearest neighbour.

K-nn

K-nn use the nearest neighbours of a point to take a decision based on majority voting. We chose to analyze this classifier since it introduce an “hyper-parameter” the number of neighbours taken into acount. This hyper-parameters must be set in order to get a final classifier. With this model we will make our first step in hyper-parameter tunning. to begin lets start with the simplest nearest neighbour classifier which only take into account one neighbour. We may fit the classifier and estimate its performance on the training set with these commands:

fit.knn=knn3(Cont~.,data=Xp.train,k=1)
predlab.train = predict(fit.knn,Xp.train,type="class")
mc.train = confusionMatrix(predlab.train,Xp.train$Cont)
mc.train$overall[1]
## Accuracy 
##        1

Whoah !! a perfect fit :) or perhaps a not so good news:

predlab.test = predict(fit.knn,Xp.test,type="class")
mc.test = confusionMatrix(predlab.test,Xp.test$Cont)
mc.test$overall[1]
##  Accuracy 
## 0.8082192

Results on the test set are a lot less convincing. This is natural since k-nn works trough memoization of the training set

fit.knn=knn3(Cont~.,data=Xp.train,k=1)
proba = predict(fit.knn,Xpgrid)
Xpgrid$proba=proba[,1]

gg=  list()
gg[[1]] = ggplot(Xpgrid)+geom_tile(aes(x=Density,y=entropie,fill=proba),alpha=0.4)+
  scale_fill_distiller(palette="RdBu",guide="none")+
  geom_point(data=Xp.train,aes(x=Density,y=entropie,shape=Cont,color=Cont),size=2)+
  coord_equal()+scale_color_brewer(palette="Set1",drop=FALSE,guide="none")+scale_shape(guide="none")+theme_bw()
  
fit.knn=knn3(Cont~.,data=Xp.train,k=7)
proba = predict(fit.knn,Xpgrid)
Xpgrid$proba=proba[,1]

gg[[2]]  = ggplot(Xpgrid)+geom_tile(aes(x=Density,y=entropie,fill=proba),alpha=0.4)+
  scale_fill_distiller(palette="RdBu",guide="none")+
  geom_point(data=Xp.train,aes(x=Density,y=entropie,shape=Cont,color=Cont),size=2)+
  coord_equal()+scale_color_brewer(palette="Set1",drop=FALSE,guide="none")+scale_shape(guide="none")+theme_bw()
marrangeGrob(gg,nrow=1,ncol=2,top="Visual comparison of k=1 and k=7 decision function")

grid = expand.grid(k=seq(1,14,2))
control = trainControl(method="repeatedcv", number=10,repeats = 5)
fitcv <- train(Cont~., data=Xp.train, method="knn", metric="Accuracy", trControl=control,tuneGrid=grid)
# display results
ggplot(fitcv$results)+geom_line(aes(x=k,y=Accuracy))+
  geom_point(aes(x=k,y=Accuracy),fill="white",size=3,shape=21)

predlab.test = predict(fit.knn,Xp.test,type="class")
mc.test = confusionMatrix(predlab.test,Xp.test$Cont)
mc.test$overall[1]
##  Accuracy 
## 0.8356164

Classification with raw fetaures

cities = read_sf("../dynamit/shp/Cities2.shp") %>% filter(ang180!=0) %>% mutate(Cont=factor(Cont))

Xa=cities %>% st_set_geometry(NULL) %>% select(c(2:39,76:80))
Xan= data.frame(Xa%>%select(-Cont)/Xa$Area)%>% select(-Area) 
Xan$Cont=Xa$Cont

library(caret)
intrain  = createDataPartition(Xan$Cont, list=FALSE,p = 0.75)
scaler= preProcess(Xan[intrain,])
Xa.train = predict(scaler,Xan[intrain,])
Xa.test  = predict(scaler,Xan[-intrain,])  

Questions

Helpyourself in using caret to fit a cart tree and tune its regularization parameter. More information and method anme and hyper parameters names can be found in caret’s help. https://topepo.github.io/caret/train-models-by-tag.html. Visualize the decision of your classifier iun the feature space and compute the model test accuracy.

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

http://www.r2d3.us/visual-intro-to-machine-learning-part-2/

grid = expand.grid(maxdepth=1:20)
control = trainControl(method="repeatedcv", number=10,repeats = 5)
fit.tree <- train(Cont~., data=Xa.train, method="rpart2", metric="Accuracy", tuneGrid=grid,trControl=control)
# display results
ggplot(fit.tree$results)+geom_line(aes(x=maxdepth,y=Accuracy))+geom_linerange(aes(x=maxdepth,ymin=Accuracy-AccuracySD,ymax=Accuracy+AccuracySD))+scale_y_continuous(limits=c(0.5,1))

predlab.test = predict(fit.tree$finalModel,Xa.test,type="class")
mc.test = confusionMatrix(predlab.test,Xa.test$Cont)
mc.test$overall[1]
##  Accuracy 
## 0.8630137
gbmGrid =  expand.grid(interaction.depth = seq(5,20,5), 
                        n.trees = seq(50,1000,50), 
                        shrinkage = 0.1,
                        n.minobsinnode = 5)
xgbGrid = expand.grid(eta = 0.1, 
                        colsample_bytree=c(0.7,0.8,0.9),
                        max_depth=c(6,9,12),
                        nrounds=seq(10,400,10),
                        gamma=1,
                        min_child_weight=5,
                        subsample=0.7)
control = trainControl(method="repeatedcv", number=10,repeats = 5)
fit <- train(Cont~., data=Xa.train, method="xgbTree", metric="Accuracy", trControl=control,tuneGrid=xgbGrid,verbose=FALSE)
# display results
ggplot(fit)

predlab.test = predict(fit,Xa.test)
mc.test = confusionMatrix(predlab.test,Xa.test$Cont)
mc.test$overall[1]
##  Accuracy 
## 0.8082192

ConvNN and keras on image patchs

This last part

library(abind)
library(caret)
# path to image folders
train_image_files_path <- "./ImgTrain/"
val_image_files_path <- "./ImgVal/"
img_width = 200
img_height = 200
cls = c("Europe","NorthAmerica")
target_size = c(img_width, img_height)
data_gen = image_data_generator(
  rescale = 1/255
)


# training images
train_image_array_gen <- flow_images_from_directory(train_image_files_path, 
                                                    data_gen,
                                                    target_size = target_size,
                                                    class_mode = "categorical",
                                                    classes = cls,
                                                    color_mode = "grayscale",
                                                    seed = 42)

# validation images
val_image_array_gen <- flow_images_from_directory(val_image_files_path, 
                                                    data_gen,
                                                    target_size = target_size,
                                                    class_mode = "categorical",
                                                    classes = cls,
                                                    color_mode = "grayscale",
                                                    seed = 42)
# define batch size and number of epochs
batch_size <- 100
epochs <- 20


# initialise model
model <- keras_model_sequential()

# add layers
model %>%
  layer_conv_2d(filter = 15, kernel_size = c(5,5), padding = "same", input_shape = c(img_width, img_height,1)) %>%
  layer_activation("relu") %>%

  # Use max pooling
  layer_max_pooling_2d(pool_size = c(5,5)) %>%

  layer_conv_2d(filter = 15, kernel_size = c(5,5), padding = "same", input_shape = c(img_width, img_height,1)) %>%
  layer_activation("relu") %>%
  
  # Use max pooling
  layer_max_pooling_2d(pool_size = c(5,5)) %>%
  
  # Flatten max filtered output into feature vector 
  # and feed into dense layer
  layer_flatten() %>%
  layer_dense(20) %>%
  layer_activation("relu") %>%

  
  # Outputs from dense layer are projected onto output layer
  layer_dense(2) %>% 
  layer_activation("softmax")
# compile
model %>% compile(
  loss = "categorical_crossentropy",
  optimizer = optimizer_adam(),
  metrics = "accuracy"
)
# fit
# number of training samples
train_samples <- train_image_array_gen$n
# number of validation samples
val_samples <- val_image_array_gen$n

hist <- model %>% fit_generator(
  # training data
  train_image_array_gen,
  
  # epochs
  steps_per_epoch = as.integer(train_samples/batch_size), 
  epochs = epochs, 
  
  # validation data
  validation_data = val_image_array_gen,
  validation_steps = as.integer(val_samples/batch_size)
  
)
# Testing 

image_paths = list.files("./ImgTest", recursive = FALSE, full.names = TRUE)
image_preprocessor = function(image_path) {
  image = image_load(image_path, target_size = c(200,200),grayscale=TRUE)
  image = image_to_array(image)
  image = array_reshape(image, c(1, dim(image)))
  image = image/255
  return(image)
}

label_extract = function(image_path) {
  classes = c("Europe","NorthAmerica")
  image_path=gsub("./ImgTest/","",image_path)
  classes[as.numeric(substr(image_path,1,1))]
}

sa=sample(length(image_paths),1000) 
image_list = lapply(image_paths[sa], image_preprocessor)
test_images=abind(image_list,along=1)

test_labels=factor(sapply(image_paths[sa],label_extract),cls)
pred_prob=model %>% predict_proba(test_images)
pred_cl=factor(cls[apply(pred_prob,1,which.max)],cls)
confusionMatrix(pred_cl,test_labels)
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Europe NorthAmerica
##   Europe          419          119
##   NorthAmerica     61          401
##                                           
##                Accuracy : 0.82            
##                  95% CI : (0.7948, 0.8433)
##     No Information Rate : 0.52            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6411          
##  Mcnemar's Test P-Value : 2.152e-05       
##                                           
##             Sensitivity : 0.8729          
##             Specificity : 0.7712          
##          Pos Pred Value : 0.7788          
##          Neg Pred Value : 0.8680          
##              Prevalence : 0.4800          
##          Detection Rate : 0.4190          
##    Detection Prevalence : 0.5380          
##       Balanced Accuracy : 0.8220          
##                                           
##        'Positive' Class : Europe          
## 
im2gg = function(im){
  im.df=data.frame(im)
  names(im.df)=1:nrow(im)
  im.df$row=1:nrow(im)
  im.df.long=gather(im.df,"col","gray",-row,convert=TRUE)
}

p=list()
ex_test = sample(1000,25)
ip=1
for (i in ex_test){
  p[[ip]]=ggplot(im2gg(test_images[i,,,]))+geom_tile(aes(x=row,y=col,fill=gray))+
    coord_equal()+
    theme(plot.title = element_text(color=ifelse(test_labels[i]==pred_cl[i],"#008000","#800000"), size=8, face="bold.italic"),
          axis.line=element_blank(),
          axis.text.x=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks=element_blank(),
          axis.title.x=element_blank(),
          axis.title.y=element_blank(),
          legend.position="none",
          panel.background=element_blank(),
          panel.border=element_blank(),
          panel.grid.major=element_blank(),
          panel.grid.minor=element_blank(),
          plot.background=element_blank(),
          panel.spacing = unit(c(0, 0, 0, 0), "cm"), 
          plot.margin = unit(c(0, 0, 0, 0), "cm"))+
    scale_fill_continuous(guide="none",low="black",high="white")+
    ggtitle(paste0(test_labels[i],"/",pred_cl[i],": ",round(max(pred_prob[i,])*100),"%"))
  ip=ip+1
}

ml <- marrangeGrob(p, nrow=5, ncol=5,top="Test set sample results")
ml